Tracing Individual Public Transport Customers from an Anonymous Transaction Database
نویسنده
چکیده
Data mining concepts are used frequently throughout the transportation research sector. This article examines the concept of the market basket technique as a means of gaining more insight into public transport users’ demands. The article proposes a method that uses various data attributes of passenger records to infer the same customer in a different week (i.e., attempts to track the same customer from week to week). The general idea behind the measure is that if two records are considered similar, ideally every trip in one customer record should have a close counterpart in the other record. The research develops a similarity function designed to maximize the percentage of positive ticket identification over a number of weeks. Once similarity has been established, customer travel patterns can be useful in helping the operator identify new routes, new timetables, and strategic decisions in relation to satisfying public transport customer demands. Introduction This study is in response to the suggestion from McCarthy ( 00 ), who argued that regular customers of a supermarket might be recognizable from patterns of their choices registered in Electronic Point of Sale (EPOS) data, and that this would help determine their long-term histories and behaviors (Chen et al. 004). Obviously, Journal of Public Transportation, Vol. 9, No. 4, 2006 48 it depends on the range of options available for each customer and on the total number of customers (e.g., in the case of a fast food restaurant offering five types of sandwiches and five types of drinks and servicing ,000 persons daily, it may be difficult to recognize a person by his or her pattern of choices). An attempt is made here to trace individual customers from an anonymous transaction database. The aim is to infer relations of passenger behavior that have not been noticed or at least have not been confirmed previously. Finding potential relationships among the entities that are not directly represented in the data are considered to be as important as relationships of entities that are directly represented in the data. For example, can the travel patterns of bus passengers tell us about their work routine, shopping, or spare time behavior? Mahmassani ( 997) elaborates on the importance of the dynamics of commuter behavior and provides an overview, focusing on day-to-day dynamics. The main focus of this article is to develop a method that facilitates finding record sets of routine passengers, which then can be used to further analyze passenger behavior and dynamics. The article provides a brief background of the research project and elaborates on the dataset used as an input source. A novel method that measures similarity between passenger records is then introduced. Finally, the article presents the results after applying the method to a subset of the entire data source. Overview In this study, magnetic strip card tickets from a public transport operator are considered. The operator provides bus services in a medium-sized European city. Train services are provided by another organization within the same group of companies. There is a predominant arterial movement of public transport services toward the city center in the morning peak periods, satisfying a well-recognized demand, and out of the city in the evenings. The tickets are issued by the public group of companies, of which the operator is one. There was no competition in the market at the time of data collection for this research, either from other bus companies or other modes such as rail. This type of ticket is generally the primary source of passenger data (Boyle 998). Wayfarer has manufactured the registration system used by the operator. A magnetic strip card reader at the entrance of a bus verifies a ticket; its serial number is copied into the internal memory of the device, and then onto a magnetic tape. Tracing Individual Public Transport Customers 49 Other events registered on the same tape are the start of a bus journey and arrival at a specific stage (stages are selected bus stops); random stops between stages are not registered. The date and time of day, type of ticket, and route number are registered along with these events. Further, the data from every bus are copied into the transactional database. Although there are many different types of prepaid magnetic card tickets, we will consider only weekly types (valid for a single week, starting on Sunday) and monthly types (valid for one calendar month). Within a week, a customer is not anonymous because all trip records for the same customer carry the same serial number of the ticket. All such trips taken together are comparable to a basket of items bought from a supermarket in a single visit. But, in the next week, the same customer, who is expected to use the same type of prepaid card, will have a different serial number. The question is whethercan be identified weekly ticket users from different weeks by analyzing their trip patterns. The segmentation of the permanent stored data is on the transactional level; that is, data are stored permanently for each passenger boarding (Furth 000). This applies regardless of whether the passenger pays with cash or has a prepaid magnetic strip card. Each piece/attribute of data is recorded as a 0-character string stored in ASCII text form. Data for a single day varies from 3 to MB, depending on whether it is for a weekday, weekend, or public holiday. The file for each day averages roughly 74,000 pre-paid ticket validations. Monthly tickets, largely similar to weekly tickets, provide an opportunity to verify whatever techniques we propose for customer identification, because they retain the same serial number throughout the month. Of course, we have to exclude weeks spanning two months, which leaves us with verification material for three (sometimes four) consecutive weeks. This article presents the results obtained for weekly portions of customer records for monthly ticket types, where the accuracy of the results can be evaluated from known customer identities. Evaluation of the same techniques for weekly types will be discussed as a separate problem. The Data The general course of processing is as follows. The “raw” data from a number of daily files are scanned sequentially. Dates and times of day contained in the records for the start of a bus journey and for the arrival at each stage are propagated to ticket records, along with the route number, direction number (“0” or Journal of Public Transportation, Vol. 9, No. 4, 2006 0 “ ”), and stage number (unique bus stop ID). Some corrupt data can be rejected at this stage. The type of every ticket is examined, and only tickets of selected types go to further processing. The enhanced ticket records are then split by weeks, and, finally, the week files are sorted by customer numbers (the ticket type is treated as part of the customer number), whereby they can be split into weekly records of individual customers. A weekly record consists of a sequence of trips; each trip is documented by day of the week, time of day, route number, direction number, and stage number. There is no information about where the customer alighted. The average number of trips per week per passenger is approximately 3. The next step is to convert the route and stage data to geographical coordinates so we can see for any two trips if they started at close locations or not. Geographical coordinates of each stage (there are approximately ,000 stages) are known. From the direction of the trip (one of the two alternatives), we can derive the list of stages ahead of the boarding stage and approximate the intended direction of the customer. Some very short weekly records (three or fewer trips per week), as well as some considered corrupt (too quick a movement between geographically remote points), have been excluded. Thus, the objective of this study is to determine whether customers can be identified by their weekly “baskets,” each containing about 3 trips starting from a choice of about ,000 locations. Figure shows the contents of a sample basket from the week starting December , 998, ticket type 9 (Weekly Student City zone) and ticket number 97. The columns show the day of the week, time of departure, stage coordinates in meters, and stage name (typically, the error in the coordinates is within 0 meters, which is sufficient to enable identification of a particular bus stop). Figure shows another basket, starting December 3, with the same ticket type 9 but a different ticket number 0 . This basket was chosen to be similar to the preceding basket displayed in Figure , and it is a plausible hypothesis that it was the same person in both cases. However, there is no way of verifying the hypothesis because tickets of type 9 are only valid within a week. This why in the following discussion we concentrate on monthly types where the serial numbers provide a clue. Table shows the ticket types considered. The numbers given for issued tickets represents the number of customers, after the filtering, for one sample week, starting September . Tracing Individual Public Transport Customers 8: 0 (3 03, 3 33 ) Stop A 0:49 (3 3 7 , 38 98) Stop B 7:48 (3 03, 3 33 ) Stop A 3 0: 8 (3 3 74, 3877 ) Stop C 3 3:4 (3 487 , 3 889) Stop D 3 4:03 (3 9 4, 34 8) Stop E 3 8:0 (3 80 , 3 0) Stop F 3 8: (3 7 9, 3 7 9) Stop G 4 0:3 (3 3 7 , 39 ) Stop H 0:0 (3 3 0, 38 3 ) Stop G 7: (3 03, 3 33 ) Stop A 0: 3 (3 3 7 , 39 ) Stop B 0: (3 888, 40 48) Stop I Figure 1. Sample Basket of Ticket # 6197 0: 7 (3 3 74, 3877 ) Stop B 8:0 (3 03, 3 33 ) Stop A 0:34 (3 3 74, 3877 ) Stop C 8:04 (3 03, 3 33 ) Stop A 3 0:3 (3 3 74, 3877 ) Stop C 3 0:4 (3 3 74, 3877 ) Stop C 3 3: (3 899, 3 0 7) Stop J 4 0:3 (3 3 7 , 38 98) Stop B 4 8:08 (3 03, 3 33 ) Stop B 0:3 (3 3 74, 3877 ) Stop C 0:3 (3 3 7 , 39 ) Stop C : 9 (3 487 , 3 889) Stop D : 7 (3 9 4, 34 8) Stop E 7:3 (3 03, 3 33 ) Stop B Figure 2. Sample Basket of Ticket # 6201 Table 1. Ticket Types Ticket Type Description Issued Tickets 433 Monthly Adult Short Hop Bus/Rail 4 7 Monthly Student Short Hop Bus/Rail 70 Monthly Adult City zone (Airings...) 397 7 0 Monthly Adult Travelwide 0 Unfortunately, most popular weekly ticket types have more customers (e.g., type 7 , Weekly Adult City zone, has about 9,000 customers weekly). Hence, customer identifi cation for those types is far more diffi cult than in the cases with known answers. Journal of Public Transportation, Vol. 9, No. 4, 2006 Measuring Similarity Between Customer Records The simplest idea for finding the same customer in a different week is to define a measure of similarity between two customer records and then to look for the best match for a specific customer record. The general idea behind the measure is that if two records are considered similar, ideally every trip in one customer record (denoted by R) should have a close counterpart in the other record (denoted by R′). The idea of identifying similarity between customers was used prior to this work in the retail sector, but this is the first time it has been used on public transport magnetic ticket data and on public transport customers. Of course, we then have to define which single trip is considered similar to which other trip; a trip being defined by the starting location, direction, and time of day (we ignore day of week for the time being) should be defined in terms of closeness of the components. If the closeness were defined as a Boolean function with only two values, we could solve a discrete task of assigning to each trip in R a close trip in R′ (a sort of assignment problem). Using a fuzzy approach, each trip in R is matched with each trip in R′, producing a numeric value. This value will be high for similar trips and close to 0 for differing trips. If we add together the values for all pairs, only the pairs with a good match will contribute significantly to the sum. So, the higher the sum, the better the match. The similarity function is defined in several stages: • Defining the weight of a trip • Estimating the direction vector of a trip • Comparing two trips from different customer records • Consolidating data per starting location
منابع مشابه
Use of Consumer Panel Survey Data for Public Health Communication Planning: an Evaluation of Survey Results
The Office of Communication at the Centers for Disease Control and Prevention licenses syndicated market research data for understanding audiences in health communication planning. Market research data bases are widely used in the commercial sector to analyze audiences and develop messages to promote products and services to potential customers. They contain proprietary and public information o...
متن کاملPotentials and Requirements of Mobile Ubiquitous Computing for Public Transport
Public transport plays an important role in our society which is characterized by mobility, individuality, comfort and ecological constraints. It is common opinion that public transport offers a high level of comfort but lacks individual flexibility compared to individual transport. While navigation systems and other context-aware services enhance the feeling of self determination for car drive...
متن کاملA hybrid approach for database intrusion detection at transaction and inter-transaction levels
Nowadays, information plays an important role in organizations. Sensitive information is often stored in databases. Traditional mechanisms such as encryption, access control, and authentication cannot provide a high level of confidence. Therefore, the existence of Intrusion Detection Systems in databases is necessary. In this paper, we propose an intrusion detection system for detecting attacks...
متن کاملPattern-Preserving k-Anonymization of Sequences and its Application to Mobil- ity Data Mining
Sequential pattern mining is a major research field in knowledge discovery and data mining. Thanks to the increasing availability of transaction data, it is now possible to provide new and improved services based on users’ and customers’ behavior. However, this puts the citizen’s privacy at risk. Thus, it is important to develop new privacy-preserving data mining techniques that do not alter th...
متن کاملIdentifying the Major Sources of Variance in Transaction Latencies: Towards More Predictable Databases
Decades of research have sought to improve transaction processing performance and scalability in database management systems (DBMSs). Far less attention has been dedicated to the predictability of performance—how often individual transactions exhibit execution latency far from the mean. Performance predictability is vital when transaction processing lies on the critical path of an interactive w...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006